NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A map of cis-regulatory modules and constituent transcription factor binding sites in 80% of the mouse genome

https://doi.org/10.1186/s12864-022-08933-7

Ni, Pengyu; Wilson, David; Su, Zhengchang (October 2022, BMC Genomics)

Abstract BackgroundMouse is probably the most important model organism to study mammal biology and human diseases. A better understanding of the mouse genome will help understand the human genome, biology and diseases. However, despite the recent progress, the characterization of the regulatory sequences in the mouse genome is still far from complete, limiting its use to understand the regulatory sequences in the human genome. ResultsHere, by integrating binding peaks in ~ 9,000 transcription factor (TF) ChIP-seq datasets that cover 79.9% of the mouse mappable genome using an efficient pipeline, we were able to partition these binding peak-covered genome regions into acis-regulatory module (CRM) candidate (CRMC) set and a non-CRMC set. The CRMCs contain 912,197 putative CRMs and 38,554,729 TF binding sites (TFBSs) islands, covering 55.5% and 24.4% of the mappable genome, respectively. The CRMCs tend to be under strong evolutionary constraints, indicating that they are likelycis-regulatory; while the non-CRMCs are largely selectively neutral, indicating that they are unlikelycis-regulatory. Based on evolutionary profiles of the genome positions, we further estimated that 63.8% and 27.4% of the mouse genome might code for CRMs and TFBSs, respectively. ConclusionsValidation using experimental data suggests that at least most of the CRMCs are authentic. Thus, this unprecedentedly comprehensive map of CRMs and TFBSs can be a good resource to guide experimental studies of regulatory genomes in mice and humans.
more » « less
Accurate prediction of functional states of cis-regulatory modules reveals common epigenetic rules in humans and mice

https://doi.org/10.1186/s12915-022-01426-9

Ni, Pengyu; Moe, Joshua; Su, Zhengchang (October 2022, BMC Biology)

Abstract BackgroundPredicting cis-regulatory modules (CRMs) in a genome and their functional states in various cell/tissue types of the organism are two related challenging computational tasks. Most current methods attempt to simultaneously achieve both using data of multiple epigenetic marks in a cell/tissue type. Though conceptually attractive, they suffer high false discovery rates and limited applications. To fill the gaps, we proposed a two-step strategy to first predict a map of CRMs in the genome, and then predict functional states of all the CRMs in various cell/tissue types of the organism. We have recently developed an algorithm for the first step that was able to more accurately and completely predict CRMs in a genome than existing methods by integrating numerous transcription factor ChIP-seq datasets in the organism. Here, we presented machine-learning methods for the second step. ResultsWe showed that functional states in a cell/tissue type of all the CRMs in the genome could be accurately predicted using data of only 1~4 epigenetic marks by a variety of machine-learning classifiers. Our predictions are substantially more accurate than the best achieved so far. Interestingly, a model trained on a cell/tissue type in humans can accurately predict functional states of CRMs in different cell/tissue types of humans as well as of mice, and vice versa. Therefore, epigenetic code that defines functional states of CRMs in various cell/tissue types is universal at least in humans and mice. Moreover, we found that from tens to hundreds of thousands of CRMs were active in a human and mouse cell/tissue type, and up to 99.98% of them were reutilized in different cell/tissue types, while as small as 0.02% of them were unique to a cell/tissue type that might define the cell/tissue type. ConclusionsOur two-step approach can accurately predict functional states in any cell/tissue type of all the CRMs in the genome using data of only 1~4 epigenetic marks. Our approach is also more cost-effective than existing methods that typically use data of more epigenetic marks. Our results suggest common epigenetic rules for defining functional states of CRMs in various cell/tissue types in humans and mice.
more » « less
PCRMS: a database of predicted cis-regulatory modules and constituent transcription factor binding sites in genomes

https://doi.org/10.1093/database/baac024

Ni, Pengyu; Su, Zhengchang (January 2022, Database)

Abstract More accurate and more complete predictions of cis-regulatory modules (CRMs) and constituent transcription factor (TF) binding sites (TFBSs) in genomes can facilitate characterizing functions of regulatory sequences. Here, we developed a database predicted cis-regulatory modules (PCRMS) (https://cci-bioinfo.uncc.edu) that stores highly accurate and unprecedentedly complete maps of predicted CRMs and TFBSs in the human and mouse genomes. The web interface allows the user to browse CRMs and TFBSs in an organism, find the closest CRMs to a gene, search CRMs around a gene and find all TFBSs of a TF. PCRMS can be a useful resource for the research community to characterize regulatory genomes. Database URL: https://cci-bioinfo.uncc.edu/
more » « less
Full Text Available
Accurate prediction of cis -regulatory modules reveals a prevalent regulatory genome of humans

https://doi.org/10.1093/nargab/lqab052

Ni, Pengyu; Su, Zhengchang (April 2021, NAR Genomics and Bioinformatics)
null (Ed.)
Abstract cis-regulatory modules(CRMs) formed by clusters of transcription factor (TF) binding sites (TFBSs) are as important as coding sequences in specifying phenotypes of humans. It is essential to categorize all CRMs and constituent TFBSs in the genome. In contrast to most existing methods that predict CRMs in specific cell types using epigenetic marks, we predict a largely cell type agonistic but more comprehensive map of CRMs and constituent TFBSs in the gnome by integrating all available TF ChIP-seq datasets. Our method is able to partition 77.47% of genome regions covered by available 6092 datasets into a CRM candidate (CRMC) set (56.84%) and a non-CRMC set (43.16%). Intriguingly, the predicted CRMCs are under strong evolutionary constraints, while the non-CRMCs are largely selectively neutral, strongly suggesting that the CRMCs are likely cis-regulatory, while the non-CRMCs are not. Our predicted CRMs are under stronger evolutionary constraints than three state-of-the-art predictions (GeneHancer, EnhancerAtlas and ENCODE phase 3) and substantially outperform them for recalling VISTA enhancers and non-coding ClinVar variants. We estimated that the human genome might encode about 1.47M CRMs and 68M TFBSs, comprising about 55% and 22% of the genome, respectively; for both of which, we predicted 80%. Therefore, the cis-regulatory genome appears to be more prevalent than originally thought.
more » « less
Full Text Available
Deciphering epigenomic code for cell differentiation using deep learning

https://doi.org/10.1186/s12864-019-6072-8

Ni, Pengyu; Su, Zhengchang (December 2019, BMC Genomics)

Full Text Available
Prevalent use and evolution of exonic regulatory sequences in the human genome

https://doi.org/10.1002/ntls.20220058

Chen, Jing; Ni, Pengyu; Wu, Siwen; Niu, Meng; Guo, Jun‐tao; Su, Zhengsheng (March 2023, Natural Sciences)

Abstract It has long been known that exons can serve ascis‐regulatory sequences, such as enhancers. However, the prevalence of such dual‐use of exons and how they evolve remain elusive. Based on our recently predicted, highly accurate large sets ofcis‐regulatory module candidates (CRMCs) and non‐CRMCs in the human genome, we find that exonic transcription factor binding sites (TFBSs) occupy at least a third of the total exon lengths, and 96.7% of genes have exonic TFBSs. Both A/T and C/G in exonic TFBSs are more likely under evolutionary constraints than those in non‐CRMC exons. Exonic TFBSs in codons tend to encode loops rather than more critical helices and strands in protein structures, while exonic TFBSs in untranslated regions (UTRs) tend to avoid positions where known UTR‐related functions are located. Moreover, active exonic TFBSs tend to be in close physical proximity to distal promoters whose genes have elevated transcription levels. These results suggest that exonic TFBSs might be more prevalent than originally thought and likely in dual‐use. We proposed a parsimonious model that well explains the observed evolutionary behaviors of exonic TFBS as well as how a stretch of codons evolve into a TFBS. Key pointsThere are more exonic regulatory sequences in the human genome than originally thought.Exonic transcription factor binding sites are more likely under negative selection or positive selection than counterpart nonregulatory sequences.Exonic transcription factor binding sites tend to be located in genome sequences that encode less critical loops in protein structures, or in less critical parts in 5′ and 3′ untranslated regions.
more » « less
Towards a map of cis-regulatory sequences in the human genome

https://doi.org/10.1093/nar/gky338

Niu, Meng; Tabari, Ehsan; Ni, Pengyu; Su, Zhengchang (May 2018, Nucleic Acids Research)

Full Text Available
ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery

https://doi.org/10.1093/bioinformatics/btz290

Li, Yang; Ni, Pengyu; Zhang, Shaoqiang; Li, Guojun; Su, Zhengchang; Berger, ed., Bonnie (May 2019, Bioinformatics)

Abstract MotivationThe availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. ResultsWe herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. Availability and implementationSource code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. Supplementary informationSupplementary materials are available at Bioinformatics online.
more » « less

Search for: All records